========================================================
The objective of the project is to understand the different elements contributing to the quality of white wine. By breaking down the different elements (features) and analyze their relationships, we want to understand how much each feature affects the quality of the white wine.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
This data set contains 4,898 white wines with 13 variables on quantifying the chemical properties of each wine. The quality of each wine is between 0 (very bad) and 10 (very excellent).
Other observations: The median quality of white wine is 6.00.
The mean quality is 5.88.
Max quality is 9.00 About 75% of white wine has a quality equal to or less than 6.00
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
The summary shows that the distribution of all the features. Interesting observations include the quality of most white wines fall between 5 and 6, with average alcohol level of 10.51.
Let’s take a quick look at the distribution plots of all the features by using grid.arrange
The main feature in the data set is quality
other chemical properties, including fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol might contribute to the quality of white wine.
Let’s take a look at the descriptions of these properties:
fixed acidity - have direct influences on teh color, balance, and taste of the wine volatile acidity - aka wine fault, is an unpleasant characteristic of a wine resulting from poor winemaking practices or storage conditions, and leading to wine spoilage citric acidity - weak organic tribasic acid residual suguar - influence how sweet a wine will taste, measured in grams of sugar per litre of wine chlorides - free sulfur dioxide - serves as an antibiotic and antioxidant, protecting wine from spoilage by bacteria dn oxidation. It helps minimize volatie acidity total sulfur dioxide - refers to both free and bound SO2 density - proportional to the sugar content and will be expected to fall as the sugar is converted into alcohol by fermentation pH - strength of acidity sulphates - added as a preservatives to prevent spoilage and oxidation at several stages of the winemaking. Without sulfites, grape juice would quickly turn to vinegar alcohol - amount of alcohol
wine quality
And let’s look at the “quality” variable specifically
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
## int [1:4898] 6 6 6 6 6 6 6 6 6 6 ...
The above bar chart shows the distribution of white wine quality. Quality of 6 has the most number and most of the white wine falls between quality of 5 to 7.
I had to transform the ‘quality’ variable into ‘rating’, by separating into groups based on the quality values. This allows a clearer scattersplot to show the relationship between alcohol level and quality. See plots in later section
Let’s take a look at how the acid variables affect the quality
Based on the charts, fixed acidity usually falls between 6.3 and 7.3. Citric acid falls between 0.27 and 0.39. Voltile acid falls between 0.21 and 0.32. pH values fall between 3.09 and 3.28.
Fixed acidity, citric acid and pH values appear to be normal distributions, except volatile
Let’s do a log transformation for volatile acidity:
The log transformation of volatile acidity now follows a normal distribution
Chlorides - falls between and . Right-skewed Residual sugar - falls between and . Right-skewed Density - falls between Alcohol falls between
Let’s do log transformation for chlorides and residual sugar
Chlorides after log transformation now looks more normal
Let’s examine the acid variables, starting with fixed acid vs citric acid:
##
## Pearson's product-moment correlation
##
## data: wwine$fixed.acidity and wwine$citric.acid
## t = 21.137, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2633067 0.3146389
## sample estimates:
## cor
## 0.2891807
Fixed acid and citric acid has a postiive correlation of 0.289
Volatile acid vs citric acid:
##
## Pearson's product-moment correlation
##
## data: wwine$volatile.acidity and wwine$citric.acid
## t = -9.3688, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1601217 -0.1050945
## sample estimates:
## cor
## -0.1327103
Volatile acid and citric acid has a negative relationship of -0.177
free.sulfur.dioxide vs total.sulfur.dioxide
##
## Pearson's product-moment correlation
##
## data: wwine$free.sulfur.dioxide and wwine$total.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5977994 0.6326026
## sample estimates:
## cor
## 0.615501
free sulfur dioxide and total sulfur dioxide has a positive correlation of 0.616
fixed acidity vs quality
##
## Pearson's product-moment correlation
##
## data: wwine$fixed.acidity and wwine$quality
## t = -8.005, df = 4896, p-value = 1.48e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.14121974 -0.08592991
## sample estimates:
## cor
## -0.1136628
fixed acidity and quality has a negative relationship of -0.114
Examine the coefficients of different features against quality
##
## Pearson's product-moment correlation
##
## data: wwine$fixed.acidity and as.numeric(wwine$quality)
## t = -8.005, df = 4896, p-value = 1.48e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.14121974 -0.08592991
## sample estimates:
## cor
## -0.1136628
##
## Pearson's product-moment correlation
##
## data: wwine$volatile.acidity and as.numeric(wwine$quality)
## t = -14.087, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2241308 -0.1702981
## sample estimates:
## cor
## -0.1973632
##
## Pearson's product-moment correlation
##
## data: wwine$citric.acid and as.numeric(wwine$quality)
## t = -0.6444, df = 4896, p-value = 0.5193
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03720595 0.01880221
## sample estimates:
## cor
## -0.009209091
##
## Pearson's product-moment correlation
##
## data: wwine$residual.sugar and as.numeric(wwine$quality)
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.12524103 -0.06976101
## sample estimates:
## cor
## -0.09757683
##
## Pearson's product-moment correlation
##
## data: wwine$chlorides and as.numeric(wwine$quality)
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2365501 -0.1830039
## sample estimates:
## cor
## -0.2099344
##
## Pearson's product-moment correlation
##
## data: wwine$free.sulfur.dioxide and as.numeric(wwine$quality)
## t = 0.57085, df = 4896, p-value = 0.5681
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01985292 0.03615626
## sample estimates:
## cor
## 0.008158067
##
## Pearson's product-moment correlation
##
## data: wwine$total.sulfur.dioxide and as.numeric(wwine$quality)
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2017563 -0.1474524
## sample estimates:
## cor
## -0.1747372
##
## Pearson's product-moment correlation
##
## data: wwine$density and as.numeric(wwine$quality)
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3322718 -0.2815385
## sample estimates:
## cor
## -0.3071233
##
## Pearson's product-moment correlation
##
## data: wwine$pH and as.numeric(wwine$quality)
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07162022 0.12707983
## sample estimates:
## cor
## 0.09942725
##
## Pearson's product-moment correlation
##
## data: wwine$sulphates and as.numeric(wwine$quality)
## t = 3.7613, df = 4896, p-value = 0.000171
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02571007 0.08156172
## sample estimates:
## cor
## 0.05367788
##
## Pearson's product-moment correlation
##
## data: wwine$alcohol and as.numeric(wwine$quality)
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
Analysis:
Positve relationship with quality - free sulfur dioxide, pH, sulphates, alcohol Negative relationship with quality - fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, total sulfur dioxide, density,
Strongest relationship was between quality and alcohol, cor.test returns 0.435
all of the features were measured against quality, with “alcohol” having the largest coefficient (0.435) Fixed acidity (-0.114), volatile acidity(-0.195), citric acid(-0.009), residual sugar (-0.09), cholrides (-0.2), free sulfur dioxide(0.008), total sulfur dioxide(-0.174), density(-.307), pH(0.09), sulphates(0.05) all have relatively weak correlationship with quality
Based on the description of the features, volatile acidity (wine fault), having a negative coefficient makes sense since the more volatie acidity, the worse of the wine quality. This is expected, however, I was suprised that it’s -0.19 and thought it would be higher
The correlation coefficient between ‘free.sulfur.dioxide’ and ‘total.sulfur.dioxide’ is 0.615. This is somewhat to be expected since one of the subset of another. Upon further review, the formula is the following:
total sulfur dioxide = free sulfur dioxide + bound sulfur dioxide
Sulfur dioxide is used as a preservative because of its anti-oxidative and anti-microbial properties, and also as a cleaning agent for barrels and winery facilities.
Other relationships were explored as well, including volative acidity and citric acid (cor.test = -0.149), fixed acidity and citric acid (cor.test = 0.289)
The following plots we’ll examine the relationship between quality and other features. But instead of using quality, we’ll use our new variable ‘rating’.
Impact of alcohol and pH on White Wine Rating:
We can see that there’s a trend on alcohol content and rating the higher the alcohol content, the higher the rating.
A lot of the bad rating (red color) appear on the left side with low alcohol content. When we move toward the middle of the chart (alcohol content between 10% to 12%), there’s a lot of green rating. The great ones (blue) appear on the right side when alcohol % is over 12%
Let’s switch variable from pH to density
Impact of alcohol and desnity on White Wine Rating:
this graph shows a similar trend as the previous one, where the higher the alcohol content, the higher the rating. The density ranges between 0.99 to 1.0
Let’s switch variable from density to volatile acidity:
Impact of volatile acidity and alcohol on White Wine Rating
Above graph continues with the trend that the higher the alcohol %, the higher the rating. This graph also shows that the higher the volatile acidity level is, the more ‘bad’ white wine rating, which is consistent with our understanding of volatile acidity
Impact of citric acidity and alcohol on White Wine Rating
This graph continues with the trend that the higher the alcohol %, the higher the rating. This graph also shows that most of the citric acid falls between 0 to 0.5
Impact of sulphates and alcohol on White Wine Rating
This graph continues with the trend that the higher the alcohol %, the higher the rating.
Impact of chlorides and pH on White Wine Rating
Let’s take a deeper look of effects of different variables and alcohol on rating:
Alcohol and pH value were evaluated against ‘rating’, which is a superset of ‘quality’. The scatterplot shows a clear distinction that the ‘bad’ rating wines are concentrated in the lower alchol level, and the “good” rating wines are more concentrated in the higher alcohol level
I had to transform the ‘quality’ variable into ‘rating’, by separating into groups based on the quality values. This allows a clearer scattersplot to show the relationship between alcohol level and quality. See plots in later section
linear regression was created using the ‘lm’ function. However, the R-squred value turned out to be 0.22, which is relatively low. Therefore, additional features need to be included to see if it improves the R-squred value
## List of 11
## $ call : language lm(formula = I(alcohol) ~ I(quality), data = wwine)
## $ terms :Classes 'terms', 'formula' length 3 I(alcohol) ~ I(quality)
## .. ..- attr(*, "variables")= language list(I(alcohol), I(quality))
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "I(alcohol)" "I(quality)"
## .. .. .. ..$ : chr "I(quality)"
## .. ..- attr(*, "term.labels")= chr "I(quality)"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(I(alcohol), I(quality))
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "I(alcohol)" "I(quality)"
## $ residuals :Class 'AsIs' Named num [1:4898] -1.788 -1.088 -0.488 -0.688 -0.688 ...
## .. ..- attr(*, "names")= chr [1:4898] "1" "2" "3" "4" ...
## $ coefficients : num [1:2, 1:4] 6.9567 0.6052 0.1063 0.0179 65.4702 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "I(quality)"
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:2] FALSE FALSE
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "I(quality)"
## $ sigma : num 1.11
## $ df : int [1:3] 2 4896 2
## $ r.squared : num 0.19
## $ adj.r.squared: num 0.19
## $ fstatistic : Named num [1:3] 1146 1 4896
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:2, 1:2] 0.0092 -0.00153 -0.00153 0.00026
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "I(quality)"
## .. ..$ : chr [1:2] "(Intercept)" "I(quality)"
## - attr(*, "class")= chr "summary.lm"
At the start of the analysis, we determined that quality is the most important element as it affects the prices of white wines. We also know that white wines are assigned into different quality scores. But how are the quality scores distributed? We first start with a univariate graph of showing the distribtuion of white wine quality. The next 2 graphs will then show how other features impact the quality score.
We then take a deeper look at the bivariate graph of showing the correlation between alcohol level and quality(rating). As the graph shows there’s a trend of higher alchol content leading to higher quality. Let’s look at a multivariate graph next showing how multiple graphs affect the quality(rating)
Last graph is a multivariate graph showing correlationship between density, alcohol and quality(rating) of wine. The trend continues that the higher the alcohol level, the higher the quality(rating) of the wine. And with density, most wine falls between 0.99 to 1.0
This was a good exercise to learn how to explore and find out relationships between different feature in a dataset. Before the analysis, I thought features such as volatile.acidity or dentiy would have the most impact on the quality of the wine. But when the cor.test analysis was performed, alcolhol shows the highest correlation to quality with 0.435. The cor.test also shows other interesting relationships including positive relationships with quality (free sulfur dioxide, pH, sulphates) and negative relationships with quality (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, total sulfur dioxide, density). The cor.test was a very good starting point for the analysis. It provides a direction of what further analysis can be done.
We then try to confirm this by exploring this relationship with bivariate and multivariate graphs. Various bivariate and multivariate graphs were produced and further strengthened the positive relationship with alcohol and quality (rating). Different variables were also included in the multivariate plot and also showed the trend of higher alcohol level and higher quality (rating)
Two tasks that I would like to continue working on this dataset for future analysis.
Explore other postiive/negative relationships with quality.
Create models. Currently I just performed a quick analysis using lm function. I would like to explore this more by using differet algorithms and check which one has the best accuracy in prediction
Dataset: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt Role of sulfur dioxide: https://winobrothers.com/2011/10/11/sulfur-dioxide-so2-in-wine/ sulfur dioxide in wine making: https://en.wikipedia.org/wiki/Sulfur_dioxide#In_winemaking acids in wine: https://en.wikipedia.org/wiki/Acids_in_wine